Reinforcement Planning: Planners as Policies

نویسندگان

  • Matt Zucker
  • Andrew Bagnell
چکیده

Introduction. State-of-the-art robotic systems [1, 2, 3] increasingly rely on search-based planning or optimal control methods to guide decision making. Similar observations can be made about computer game engines. Such methods are nearly always extremely crude approximations to the reality encountered by the robot: they consider a simplified model of the robot (as a point, or a “flying brick”), they often model the world deterministically, and they nearly always optimize a surrogate cost function chosen to induce the correct behavior rather than the “true” reward function corresponding to a declarative task description. Such approximations are made as they enable efficient, real-time solutions. Despite this crudeness, optimal control methods have proven quite valuable because of the ability to transfer knowledge to new domains; given a way to map features of the world to a cost function, we can compute a plan that navigates a robot in a never-before-visited part of the world. While value-function methods and reactive policies are popular in the reinforcement learning community, it often proves remarkably difficult to transfer the ability to solve a particular problem to related ones using such methods [4]. Planning methods, by contrast, consider a sequence of decisions in the future, and rely on the principle underlying optimal control that cost functions are more parsimonious and generalizable than plans or values. However, planners are only successful to the extent that they can transfer domain knowledge to novel situations. Most of the human effort involved in getting systems to work with planners stems from the tedious and error-prone task of adjusting surrogate cost functions, which has until recently been a black art. Imitation learning by Inverse Optimal Control, using, e.g. the Learning to Search approach [3], has proven to be an effective method for automating this adjustment; however, it is limited by human expertise. Our approach, Reinforcement Planning (RP), is straightforward: we demonstrate that crude planning algorithms can be learned as part of a direct policy search or value function approximator, thereby allowing a system to experiment on its own and enabling outperforming human expert demonstration. It is important in this work to distinguish between the “true” reward function r(x), true state (observation) space X , and control space A defined by the problem domain and the surrogate states and costs used to generate plans, S and c(s). We rely on implicit map that can take any x to a corresponding s; this function should be thought of as a many-to-one coarse-graining. We note that planning algorithms have been used extensively in previous RL work (e.g. Dyna [5]); our work contrasts with these in embracing the reality that real-world planners operate on a coarse approximation and must be trained to have a cost-function that induces the correct behavior, not the “true” cost function. Value-Function Subgradient. To enable RP, we rely on the fact that the value of a state s has a subgradient with respect to the parameters of the planner’s cost function. The state-action cost for an optimal planner is given by a function c(s, a, θ) that takes as input a state-action pair as well as a parameter vector θ. Then the optimal value V (s, θ) of a state is given by V (s, θ) = minξ(s) ∑ i∈ξ(s) c(si, ai, θ) where ξ(s) is a trajectory of state-action pairs. Then a sub-gradient of V (s, θ)with respect to the cost parameters θ is simply the gradient at the minimizer: ∇θV (s, θ) = ∑ i∈ξ∗(s)∇θ c(si, ai, θ). This result enables us to extend any gradient-based RL algorithms to leverage optimal control methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimizing Player Experience in Interactive Narrative Planning: A Modular Reinforcement Learning Approach

Recent years have witnessed growing interest in data-driven approaches to interactive narrative planning and drama management. Reinforcement learning techniques show particular promise because they can automatically induce and refine models for tailoring game events by optimizing reward functions that explicitly encode interactive narrative experiences’ quality. Due to the inherently subjective...

متن کامل

Intelligent Cooperative Control Architecture: A Framework for Performance Improvement Using Safe Learning

Planning for multi-agent systems such as task assignment for teams of limited-fuel unmanned aerial vehicles (UAVs) is challenging due to uncertainties in the assumed models and the very large size of the planning space. Researchers have developed fast cooperative planners based on simple models (e.g., linear and deterministic dynamics), yet inaccuracies in assumed models will impact the resulti...

متن کامل

Balancing Learning and Engagement in Game-Based Learning Environments with Multi-objective Reinforcement Learning

Game-based learning environments create rich learning experiences that are both effective and engaging. Recent years have seen growing interest in data-driven techniques for tutorial planning, which dynamically personalize learning experiences by providing hints, feedback, and problem scenarios at runtime. In game-based learning environments, tutorial planners are designed to adapt gameplay eve...

متن کامل

PRM-RL: Long-range Robotic Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning

We present PRM-RL, a hierarchical method for long-range navigation task completion that combines samplingbased path planning with reinforcement learning (RL) agents. The RL agents learn short-range, point-to-point navigation policies that capture robot dynamics and task constraints without knowledge of the large-scale topology, while the sampling-based planners provide an approximate map of the...

متن کامل

Forward and Bidirectional Planning Based on Reinforcement Learning and Neural Networks in a Simulated Robot

Building intelligent systems that are capable of learning, acting reactively and planning actions before their execution is a major goal of artificial intelligence. This paper presents two reactive and planning systems that contain important novelties with respect to previous neural-network planners and reinforcement-learning based planners: (a) the introduction of a new component (“matcher”) a...

متن کامل

A Planning Modular Neural-network Robot for Asynchronous Multi-goal Navigation Tasks

This paper focuses on two planning neural-network controllers, a "forward planner" and a "bidirectional planner". These have been developed within the framework of Sutton's Dyna-PI architectures (planning within reinforcement learning) and have already been presented in previous papers. The novelty of this paper is that the architecture of these planners is made modular in some of its component...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010